1 Overview & Motivation

Ever since COVID-19 erupted into our world, research institutes and governments have released plenty of databases publicly to allow research groups and independent individuals to analyze the data around the coronavirus’s spread. We are facing an unprecedented public health crisis with the Coronavirus (Covid-19) outbreak. We believe that data-driven decisions, and people working together for the greater good, are one of the better ways to tackle and deal with this difficult time.

In this blog, we are interested to know ‘How the world’s news media is covering the COVID-19 pandemic?’ Building on the massive television news narratives dataset GDELT released a powerful news dataset of the URLs, titles, publication dates and brief snippet of more than 1.1 million worldwide English language online news articles mentioning the virus to enable researchers and journalists to understand the global context of how the outbreak has been covered since November 2019. This dataset has been expanding daily and includes a number of related topics.

A single article on Covid-19 can cover various topics like health, business implications of the disease or climate changes or it could just be a front to propagate fake information. Given the huge amount of news articles floating around the web in the wake of Covid-19, it is very difficult to compile and compare the news articles. To conduct an analysis of what is being discussed during these difficult times, we would have to first collect all the news articles and then annotate them according to their implicit news sub-categories. This motivates us to create an approach such that we could annotate news articles on Coronavirus without any manual intervention. By creating such a pipeline we not only aim to help researchers, media persons and Journalists to have access to similar articles but also avoid the overhead of time spent in reading and understanding unrelated articles. Thus we aim to improve the quality of similar articles and thus topics representing them.

We intend to solve the huge flow of information called “information overload” which makes it harder for users to find similar information on Covid-19 on the internet. We solve this with an application that enables the user to find news of their query/interest effortlessly. We are foreseeing some challenges, that include determining the subtopic, extract only the content of each webpage and present the data to the user. In real-world applications, multi-label classification (MLC) has a lot of utility in which objects can be identified by more than one label. It’s costly and tedious to manually label the dataset. An unsupervised learning approach should, therefore, be considered to take advantage of clustering similar datasets and eventually doing topic modelling to multi-label the clusters. We use unsupervised learning technique(Clustering) to group a collection of articles so that articles in the same category are more similar to each other than those in other groups. Clustering can be used to help classify the types of a structure discovered.

We are trying to analyze the large set of news articles to help make it easier for common people to filter through many articles related to the virus, and find their own resoluteness.Furthermore, we would want to understand the semantic relations between different topics. And finally, analyze keywords to uncover patterns in the news content.

3 Research Questions

Can we find articles with similar topics to a given an article ?
In order to answer this question, we need to answer the following reasearch questions:
1. What is the most dominant topic in the article?
2. How to determine the value of K is best suited and interpretable for topic modeling on our dataset ?
3. How does the topic model perform with different features, namely Term frequency–Inverse document     frequency (Tf - Idf) along with Bag of Words and Bag of words (TF) by itself.

4 Dataset

5 Exploratory Analysis

5.1 Overview of the Dataset with plots :

5.1.1 Distribution of Articles :

The news articles were distributed over the below categories.These categories are nothing but the keywords that were used to collect the articles by the GDELT project. Although we do not intend to use these labels assigned to each article, but in order to avoid any biased results, we’ve taken a fair distribution of articles from every set.
...

5.1.2 Pre-processing :

Our Dataset is in text format and therefore we pre-processed it before performing any kind of exploratory analysis. This was required in order to clean it and remove unnecessary words or characters that would affect our analysis in any way.Pre-processing is one of the very important steps of Natural Language processing, because a well pre-processed data speeds up the computation time required for further analysis and also the quality of tokens and results tend to be higher compared to the poorly pre-processed data.

Steps taken for Pre-processing
* Removed URL’s from the content * Replaced punctuations, numbers and any other characters apart from alphabets * Coverted Latin words to Utf-8 * Conerted the text to lower case * Removed Stop words

5.1.3 Wordcloud :

Wordclouds are a representative of underlying words in any text or the news articles dataset in our case.We’ve generated wordclouds for 2 different models of Bag of Words, that are with Term Frequency and TF-IDF. We wanted to analyse and understand how does the size and type of words differ when the weighting scheme changes for the same corpus.

5.1.3.1 TF :

As we can see in the below wordcloud, news articles have been all about the coronavirus pandemic.The Terms with higher frequencies are the ones bigger in size. Since the method used for below wordcloud is TF the most

wordcloud2(d_bow,shape = "star",size = 0.4)
...

5.1.3.2 TF-IDF :

Using Bag of Words Model with TF-IDF Weighting scheme.

wordcloud2(d_tfidf,shape = "star",size = 0.15)
...

6 Final Analysis

6.2 Topic Modelling using LDA for Prediction :

6.2.1 Bag of Words with Term Frequency as Weighting Scheme :

6.2.1.1 Parallel Co-ordinates plot for Gibbs Sampling as Model 1 :

#Plot
ggparcoord(bow_test_train_model1,
           columns = 1:25, groupColumn = 26,
           scale = 'uniminmax',
           showPoints = TRUE,
           title = "Parallel Coordinate Plot For Model 1 with BOW as FE",
           alphaLines = 0.1
) + scale_color_viridis(discrete=TRUE) +
  theme_ipsum()+
  theme(plot.title = element_text(size=8))
Plot 1 for Gibbs Sampling as Model 1

Plot 1 for Gibbs Sampling as Model 1

6.2.1.2 PC plot for Dot Product as Model 2 :

#Plot
ggparcoord(bow_test_train_model2,
           columns = 1:25, groupColumn = 26,
           scale = 'uniminmax',
           showPoints = TRUE,
           title = "Parallel Coordinate Plot For Model 2 with BOW as FE",
           alphaLines = 0.1
) + scale_color_viridis(discrete=TRUE) +
  theme_ipsum()+
  theme(plot.title = element_text(size=8))
Plot 2 for Dot Product as Model 2

Plot 2 for Dot Product as Model 2

6.2.2 Model Evaluation :

Below metric was used for evaluating model.

6.2.2.1 Likelihood :

#Plot
Likelihood Score

Likelihood Score

6.3 Visualizations :

The first step was to reduce teh dimensionality Using Tsne from a feature set of 1000’s of columns representing words for the BOW with TF weighting scheme.

6.3.2 Clustering :

####clustering via k means

6.3.2.0.1 Convex Hull Plot :
k3 <- kmeans(X,centers = 8, nstart = 5,iter.max = 100000L)

fviz_cluster(k3,X)
Convex Hull Plot for 8 clusters

Convex Hull Plot for 8 clusters

6.3.2.0.2 Evaluation Using Silhouette Coefficient for 8 clusters :
Silhouette Coefficient Plot 1

Silhouette Coefficient Plot 1

6.3.2.0.3 Evaluation Using Silhouette Coefficient for 15 clusters :
Silhouette Coefficient Plot 2

Silhouette Coefficient Plot 2

6.3.2.0.4 Topic to Word Occurence :

Sankey Network Diagram

links <- data.frame(
  source = top_terms$topic,
  target = top_terms$term,
  value = top_terms$beta
)

nodes <- data.frame(
  name=c(as.character(links$source), 
         as.character(links$target)) %>% unique()
)

# With networkD3, connection must be provided using id, not using real name like in the links dataframe.. So we need to reformat it.
links$IDsource <- match(links$source, nodes$name)-1 
links$IDtarget <- match(links$target, nodes$name)-1
# Make the Network
p <- sankeyNetwork(Links = links, Nodes = nodes,
                   Source = "IDsource", Target = "IDtarget",
                   Value = "value", NodeID = "name", 
                   colourScale = JS("d3.scaleOrdinal(d3.schemeCategory20);"),
                   sinksRight=FALSE,fontSize = 16,height = 1400,width = 1200,
                   nodePadding = 8, fontFamily = "arial",unit = "Letter(s)")
p
6.3.2.0.5 Topics associated to each Cluster :

Chord Diagram

chordDiagram(new_v,big.gap = 10,directional = 1, direction.type = c("diffHeight", "arrows"),link.arr.type = "big.arrow", diffHeight = -mm_h(1),grid.col = c("violet", "blue4", "blue","green", "yellow","tomato","red","cyan4","deeppink","cyan3","chocolate4","darkslategrey","darksalmon","chartreuse","darkorchid2","deepskyblue1","lightcoral", "palegreen4", "paleturquoise2","palevioletred", "peru", "pink4", "purple2","sienna1","skyblue2","seagreen2","rosybrown","plum3","slateblue2","orange3","darkgoldenrod2","salmon2","pink2")
Chord Diagram for Topic to Cluster association

Chord Diagram for Topic to Cluster association

6.3.2.0.6 Probability Distribution of Topics in each Cluster :
bp <- ggplot(temp, aes(x= Topic_Number,
      y=Topic_Probability, group = 1)) + 
      geom_line(color = "steelblue",size = 2) + 
      geom_point(size = 2) +
      labs(title = "Topic Distribution in each Cluster",
      y = "Average Probability", x = "Topic Number")

bp +facet_grid(Cluster_Number ~ .)
Probability Distribution of Topics in each cluster

Probability Distribution of Topics in each cluster

6.3.2.0.7 LDA Vis :
Terms in Topics

Terms in Topics

#####R bokeh plot :

figure(title = "Rbokeh plot representing Documents and topics", width = 1200, height = 600) %>%
  
  ly_points(x = X1, y = X2,data = bow_test_train_model1,color = type,
            hover=c(topic_highest_prob,title,url,keywords),size = 3
  ) %>%
  
  set_palette(discrete_color = pal_color(c( "red", "skyblue")))
figure(title = "Rbokeh plot representing Documents and topics", width = 1200, height = 600) %>%
  
  ly_points(x = X1, y = X2,data = bow_test_train_model1,color = topic_highest_prob,
            hover=c(topic_highest_prob,title,url),size = 5) 

#####Documents Similarity :

7 Conclusion

What did you learn about the data? How did you answer the questions? How can you justify your answers?

8 Future Work

9 References

10 Team members

Name Email-Id Mattr No.
Calida Pereira calida.pereira@st.ovgu.de 229945
Chandan Radhakrishna chandan.radhakrishna@st.ovgu.de 229746
Nandish Bandi Subbarayappa nandish.bandi@st.ovgu.de 229591
Mohit Jaripatke mohit.jaripatke@st.ovgu.de 224651
Priyanka Bhargava priyanka.bhargava@st.ovgu.de 229675